Choosing the Right Estimator - Scikit-Learn

Import Dependencies

In [1]:
import os
import pandas as pd
import numpy as np
from flask import jsonify
import seaborn as sns
import matplotlib.pyplot as plt
#To visualize the whole grid
pd.options.display.max_columns = 999

Train & Test File CleanUp

In [2]:
# Read both train and test data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")
In [4]:
# Because Product 3 and 2 have missing values, calculate the % of missing values
null_values_train = null_values_train[null_values_train >0]/train.shape[0]
null_values_test = null_values_test[null_values_test >0]/test.shape[0]
with open('missing_values.html', "a") as missing:
    print('--------------Train % -----------', file = missing)
    print(f'{null_values_train*100} %', file = missing)
    print('--------------Test % -----------', file = missing)
    print(f'{null_values_test*100} %', file = missing)   

The output of the missing values in Product Cat 3 & 2 showed us that Product Cat 3 has 69.6% of missing values and Product Cat 2 has 31.5% of missing values. Hence, we can assume that for all missing values (NA), people actually did not buy those products categories and we will fill those NA values with zeros.

In [5]:
# Convert all null values to 0. It is assumed that the if it is null value, the customer did not buy such product
train = train.fillna(0)
test =  test.fillna(0)
In [7]:
# Lets drop User ID and Product ID
drop_columns = ['User_ID', 'Product_ID']
train.drop(drop_columns, inplace = True, axis =1)
test.drop(drop_columns, inplace = True, axis =1)
test
Out[7]:
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3
0 M 46-50 7 B 2 1 1 11.0 0.0
1 M 26-35 17 C 0 0 3 5.0 0.0
2 F 36-45 1 B 4+ 1 5 14.0 0.0
3 F 36-45 1 B 4+ 1 4 9.0 0.0
4 F 26-35 1 C 1 0 4 5.0 12.0
... ... ... ... ... ... ... ... ... ...
233594 F 26-35 15 B 4+ 1 8 0.0 0.0
233595 F 26-35 15 B 4+ 1 5 8.0 0.0
233596 F 26-35 15 B 4+ 1 1 5.0 12.0
233597 F 46-50 1 C 4+ 0 10 16.0 0.0
233598 F 46-50 0 B 4+ 1 4 5.0 0.0

233599 rows × 9 columns

Convert features to ordinal/discrete values

We are converting features to ordinal/discrete values in order to execute the model.

In [8]:
# Need to convert all objects to numeric. Let's start with Stay_In_Current_City_Years
train['Stay_In_Current_City_Years'].unique()

# # Convert 4+ to 4

train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].map({'4+':4, '1':1, '2':2, '3':3,'0':0})
test['Stay_In_Current_City_Years'] = test['Stay_In_Current_City_Years'].map({'4+':4, '1':1, '2':2, '3':3,'0':0})

train['Stay_In_Current_City_Years'].unique()
test
Out[8]:
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3
0 M 46-50 7 B 2 1 1 11.0 0.0
1 M 26-35 17 C 0 0 3 5.0 0.0
2 F 36-45 1 B 4 1 5 14.0 0.0
3 F 36-45 1 B 4 1 4 9.0 0.0
4 F 26-35 1 C 1 0 4 5.0 12.0
... ... ... ... ... ... ... ... ... ...
233594 F 26-35 15 B 4 1 8 0.0 0.0
233595 F 26-35 15 B 4 1 5 8.0 0.0
233596 F 26-35 15 B 4 1 1 5.0 12.0
233597 F 46-50 1 C 4 0 10 16.0 0.0
233598 F 46-50 0 B 4 1 4 5.0 0.0

233599 rows × 9 columns

In [9]:
# Need to convert all objects to numeric.

train['City_Category'].unique()
train['City_Category'] = train['City_Category'].map({'A':0, 'B':1, 'C':2})
test['City_Category'] = test['City_Category'].map({'A':0, 'B':1, 'C':2})
train['City_Category'].unique()
Out[9]:
array([0, 2, 1], dtype=int64)
In [10]:
# Need to convert all objects to numeric.

train['Age'].unique()
train['Age'] = train['Age'].map({'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6})
test['Age'] = test['Age'].map({'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6})
train['Age'].unique()
Out[10]:
array([0, 6, 2, 4, 5, 3, 1], dtype=int64)
In [11]:
# Need to convert all objects to numeric.

train['Gender'].unique()
train['Gender'] = train['Gender'].map({'F':0, 'M':1})
test['Gender'] = test['Gender'].map({'F':0, 'M':1})
train['Gender'].unique()
Out[11]:
array([0, 1], dtype=int64)

Correlation Matrix

The correlation matrix is helping us to understand the strenght of the linear relations between two variables. Ideally we want to understand what is the relation between the dependent variable (Purchase) and all other independent variables. The correlation heatmap shows us that the dependent variable Purchase has correlation with the following independent variables: Marital Status, Age, and Product_Cat 3.

Why only the Product Cat 3 has a linear relation with the dependent variable? See 3.3 Does this means that these three independent variables have the most influence in the output of the dependent variable?
To be determined

In [12]:
# Create a correlation matrix to find out which independent variable is correlated with the dependent variable
correlation_matrix = train.corr()
fig,ax = plt.subplots(figsize = (12,9))
sns.heatmap(correlation_matrix, vmax=0.8, cmap="coolwarm", square=True, annot = True)
plt.savefig("../images/corr_matrix.png", bbox_inches = "tight")

Products Categories analysis

In [13]:
# Why product category 3 is correlated with dependent variable purchase and the others are not?
cat_1_average = train['Product_Category_1'].mean()
cat_2_average = train['Product_Category_2'].mean()
cat_3_avarage= train['Product_Category_3'].mean()
print(f"PC1: {cat_1_average} \n PC2: {cat_2_average} \n PC3 : {cat_3_avarage} \n Hence we can conclude the correlation of PC3 is due to average price. PC3 has on average the cheaper price")
PC1: 5.404270017525106 
 PC2: 6.735436346051761 
 PC3 : 3.841941359977312 
 Hence we can conclude the correlation of PC3 is due to average price. PC3 has on average the cheaper price

Data Preparation for Model

In [14]:
# Remove category 19 and 20 fron Product Category 1 from Train & Test
removal_train = train.index[(train.Product_Category_1.isin([19,20]))]
train = train.drop(removal_train)
# Remove category 19 and 20 fron Product Category 1 from Train & Test
removal_test = test.index[(test.Product_Category_1.isin([19,20]))]
test = test.drop(removal_test)
In [17]:
# Check data types
train.dtypes
Out[17]:
Gender                          int64
Age                             int64
Occupation                      int64
City_Category                   int64
Stay_In_Current_City_Years      int64
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

Save clean files

In [18]:
# Save clean file
train.to_csv("../../data/trainCLEAN.csv", index=False, encoding='utf8')
test.to_csv("../../data/testCLEAN.csv", index=False, encoding='utf8')

Linear Regression Model Non-Scaled

In [19]:
train.columns
Out[19]:
Index(['Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase'],
      dtype='object')

Assing the features to X and y. Where X are the independent variables and y is the target value

In [20]:
X = train[['Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3']]
y = train['Purchase'].values.reshape(-1, 1)
print(X.shape, y.shape)
(545915, 9) (545915, 1)

Train Test Split Data

In [21]:
# Split the data into training and testing 80% train and 20% test
### BEGIN SOLUTION
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42) 
### END SOLUTION
In [22]:
### BEGIN SOLUTION
from sklearn.linear_model import LinearRegression
model = LinearRegression(n_jobs = 1)
model.fit(X_train, y_train)
### END SOLUTION
Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Model Predictions & Residual Plot

Run the predictions with .predict(X_test) and plot a residual plot. If the points in the residual plot are randomly disperse around the horizontal axis, a linear regression model is appropriate for the data. Otherwise, a non-linear model is more appropriate

In [23]:
# Make predictions using the X_test data
# Plot y_test vs y_test
# Scatter plot y_tes_scaled vs predictions

### BEGIN SOLUTION
predictions = model.predict(X_test)

plt.scatter(model.predict(X_train), model.predict(X_train) - y_train, c="blue", label="Training Data")
plt.scatter(model.predict(X_test), model.predict(X_test) - y_test, c="orange", label="Testing Data")
plt.legend()
plt.hlines(y=0, xmin=y_test.min(), xmax=y_test.max())
plt.title("Residual Plot")
plt.show()
### END SOLUTION
In [24]:
linear_predictions = model.predict(X_test)
print("Predicted purchases (in dollars) for new costumers:", linear_predictions)
Predicted purchases (in dollars) for new costumers: [[8315.10636055]
 [8115.38982673]
 [9794.48974187]
 ...
 [8518.44545858]
 [8836.88427431]
 [8058.01384862]]

Model evaluation

To evaluate the model we will use the MSE (Mean Square Error), RMSE (Root Mean Squared Error) and R2

MSE: Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

RMSE: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. RMSE is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

R2: R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. In general, the higher the R-squared, the better the model fits your data.

R-squared as a Biased Estimate R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.

  • An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
  • A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!

R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.

Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about using the adjusted R-squared to compare regression models with a different number of terms.

In [25]:
# Used X_test, y_test, and model.predict(X_test) to calculate MSE and R2

### BEGIN SOLUTION

from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test, predictions)
r2 = model.score(X_test, y_test)
r2_adjusted = 1 - (1-r2)*(len(y)-1)/(len(y)-X.shape[1]-1)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}, R2_Adjusted: {r2_adjusted}")
MSE: 21447788.690087605, RMSE: 4631.175735176501, R2: 0.13530769254395825, R2_Adjusted: 0.13529343689367634

Regression Coeffiecients

In [26]:
coeff_df = pd.DataFrame(model.coef_[0], X_train.columns, columns=['Coefficient'])  
coeff_df
Out[26]:
Coefficient
Gender 483.447671
Age 102.735701
Occupation 5.853371
City_Category 333.854684
Stay_In_Current_City_Years 9.343869
Marital_Status -50.043793
Product_Category_1 -319.058835
Product_Category_2 8.675524
Product_Category_3 147.523879
In [27]:
ax = coeff_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)

Coefficients Correlation heatmap

In [28]:
import seaborn as sns
corr = X.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

LASSO Model

https://www.statisticshowto.datasciencecentral.com/lasso-regression/

Lasso (least absolute shrinkage and selection operator) regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

Multicollinearity

Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are: a person’s height and weight, age and sales price of a car, or years of education and annual income. An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.

It’s more common for multicollineariy to rear its ugly head in observational studies; it’s less common with experimental data. When the condition is present, it can result in unstable and unreliable regression estimates. Several other problems can interfere with analysis of results, including:

  • The t-statistic will generally be very small and coefficient confidence intervals will be very wide. This means that it is harder to reject the null hypothesis.
  • The partial regression coefficient may be an imprecise estimate; standard errors may be very large.
  • Partial regression coefficients may have sign and/or magnitude changes as they pass from sample to sample.
  • Multicollinearity makes it difficult to gauge the effect of independent variables on dependent variables.
In [29]:
# LASSO model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=.01).fit(X_train, y_train)

lasso_predictions = lasso.predict(X_test)

MSE = mean_squared_error(y_test, lasso_predictions)
r2 = lasso.score(X_test, y_test)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 21447789.082953833, RMSE: 4631.175777591889, R2: 0.13530767670510369
In [30]:
lasso_df = pd.DataFrame(lasso.coef_, X_train.columns, columns=['Coefficient'])  
lasso_df
Out[30]:
Coefficient
Gender 483.393670
Age 102.725619
Occupation 5.853760
City_Category 333.838240
Stay_In_Current_City_Years 9.338470
Marital_Status -49.993610
Product_Category_1 -319.058545
Product_Category_2 8.675363
Product_Category_3 147.523922
In [31]:
ax = lasso_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)

Ridge Model

https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.

In [32]:
# Ridge model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import Ridge

### BEGIN SOLUTION
ridge = Ridge(alpha=.01).fit(X_train, y_train)

ridge_predictions = ridge.predict(X_test)

MSE = mean_squared_error(y_test, ridge_predictions)
r2 = ridge.score(X_test, y_test)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 21447788.690393727, RMSE: 4631.175735209551, R2: 0.13530769253161656
In [33]:
ridge_df = pd.DataFrame(ridge.coef_[0], X_train.columns, columns=['Coefficient'])  
ridge_df
Out[33]:
Coefficient
Gender 483.447610
Age 102.735699
Occupation 5.853372
City_Category 333.854670
Stay_In_Current_City_Years 9.343869
Marital_Status -50.043787
Product_Category_1 -319.058835
Product_Category_2 8.675524
Product_Category_3 147.523879
In [34]:
ax = ridge_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)

ElasticNet Model

https://hackernoon.com/an-introduction-to-ridge-lasso-and-elastic-net-regression-cca60b4b934f

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

In [35]:
# ElasticNet model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import ElasticNet

### BEGIN SOLUTION
elasticnet = ElasticNet(alpha=.01).fit(X_train, y_train)

elasticnet_predictions = elasticnet.predict(X_test)

MSE = mean_squared_error(y_test, elasticnet_predictions)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 21447890.682992008, RMSE: 4631.186746719679, R2: 0.13530769253161656
In [36]:
elasticnet_df = pd.DataFrame(elasticnet.coef_, X_train.columns, columns=['Coefficient'])  
elasticnet_df
Out[36]:
Coefficient
Gender 470.540282
Age 102.423529
Occupation 5.967020
City_Category 330.953306
Stay_In_Current_City_Years 9.400863
Marital_Status -48.755231
Product_Category_1 -318.998347
Product_Category_2 8.678921
Product_Category_3 147.563751
In [37]:
ax = elasticnet_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)

Regression Model - Scaled & Get Dummies

In [38]:
# Read both train and test data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")

# Check for missing values in the train and test file 
null_values_train = train.isnull().sum().sort_values(ascending = False)
null_values_train

null_values_test = test.isnull().sum().sort_values(ascending = False)
null_values_test

# Convert all null values to 0. It is assumed that the if it is null value, the customer did not buy such product
train = train.fillna(0)
test =  test.fillna(0)
In [39]:
# Lets drop User ID and Product ID
drop_columns = ['User_ID', 'Product_ID']
train.drop(drop_columns, inplace = True, axis =1)
# test.drop(drop_columns, inplace = True, axis =1)
In [40]:
# Remove category 19 and 20 fron Product Category 1 from Train
removal = train.index[(train.Product_Category_1.isin([19,20]))]
train = train.drop(removal)
train
Out[40]:
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
0 F 0-17 10 A 2 0 3 0.0 0.0 8370
1 F 0-17 10 A 2 0 1 6.0 14.0 15200
2 F 0-17 10 A 2 0 12 0.0 0.0 1422
3 F 0-17 10 A 2 0 12 14.0 0.0 1057
4 M 55+ 16 C 4+ 0 8 0.0 0.0 7969
... ... ... ... ... ... ... ... ... ... ...
545910 M 26-35 6 B 2 0 8 0.0 0.0 9855
545911 M 26-35 6 B 2 0 5 0.0 0.0 1962
545912 M 26-35 6 B 2 0 8 0.0 0.0 7852
545913 M 26-35 6 B 2 0 5 0.0 0.0 7159
545914 M 26-35 6 B 2 0 1 2.0 11.0 11640

545915 rows × 10 columns

In [41]:
train = pd.get_dummies(train)
train.head()
Out[41]:
Occupation Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase Gender_F Gender_M Age_0-17 Age_18-25 Age_26-35 Age_36-45 Age_46-50 Age_51-55 Age_55+ City_Category_A City_Category_B City_Category_C Stay_In_Current_City_Years_0 Stay_In_Current_City_Years_1 Stay_In_Current_City_Years_2 Stay_In_Current_City_Years_3 Stay_In_Current_City_Years_4+
0 10 0 3 0.0 0.0 8370 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0
1 10 0 1 6.0 14.0 15200 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0
2 10 0 12 0.0 0.0 1422 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0
3 10 0 12 14.0 0.0 1057 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0
4 16 0 8 0.0 0.0 7969 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1
In [43]:
X = train[['Occupation', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Gender_F',
       'Gender_M', 'Age_0-17', 'Age_18-25', 'Age_26-35', 'Age_36-45',
       'Age_46-50', 'Age_51-55', 'Age_55+', 'City_Category_A',
       'City_Category_B', 'City_Category_C', 'Stay_In_Current_City_Years_0',
       'Stay_In_Current_City_Years_1', 'Stay_In_Current_City_Years_2',
       'Stay_In_Current_City_Years_3', 'Stay_In_Current_City_Years_4+']]
y = train['Purchase'].values.reshape(-1, 1)
print(X.shape, y.shape)
(545915, 22) (545915, 1)

Train Test Split Data

In [44]:
# Split the data into training and testing

### BEGIN SOLUTION
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42) # need to add stratify = y but I get an error
### END SOLUTION

StandardScaler

In [45]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler model and fit it to the training data

### BEGIN SOLUTION
X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)
## END SOLUTION
In [46]:
# Transform the training and testing data using the X_scaler and y_scaler models

### BEGIN SOLUTION
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
### END SOLUTION

Multivariate Linear Regression

In [47]:
### BEGIN SOLUTION
from sklearn.linear_model import LinearRegression
model = LinearRegression(n_jobs = 1)
model.fit(X_train_scaled, y_train_scaled)
### END SOLUTION
Out[47]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Model Predictions & Residual Plot

Run the predictions with .predict(X_test) and plot a residual plot. If the points in the residual plot are randomly disperse around the horizontal axis, a linear regression model is appropriate for the data. Otherwise, a non-linear model is more appropriate

In [48]:
# Make predictions using the X_test_scaled data
# Plot y_test_scaled vs y_test_scaled
# Scatter plot y_test_scaled vs predictions

### BEGIN SOLUTION
predictions = model.predict(X_test_scaled)

plt.scatter(model.predict(X_train_scaled), model.predict(X_train_scaled) - y_train_scaled, c="blue", label="Training Data")
plt.scatter(model.predict(X_test_scaled), model.predict(X_test_scaled) - y_test_scaled, c="orange", label="Testing Data")
plt.legend()
plt.hlines(y=0, xmin=y_test_scaled.min(), xmax=y_test_scaled.max())
plt.title("Residual Plot")
plt.show()
### END SOLUTION

Model Evaluation

To evaluate the model we will use the MSE (Mean Square Error), RMSE (Root Mean Squared Error) and R2

MSE: Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

RMSE: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. RMSE is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

R2: R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. In general, the higher the R-squared, the better the model fits your data.

R-squared as a Biased Estimate R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.

  • An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
  • A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!

R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.

Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about using the adjusted R-squared to compare regression models with a different number of terms.

In [49]:
# Used X_test_scaled, y_test_scaled, and model.predict(X_test_scaled) to calculate MSE and R2

### BEGIN SOLUTION

from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test_scaled, predictions)
r2 = model.score(X_test_scaled, y_test_scaled)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 0.8644605079137399, RMSE: 0.9297636839077658, R2: 0.1357641511192248
In [50]:
model.coef_[0]
Out[50]:
array([ 7.76118780e-03, -5.79231674e-03, -2.40291660e-01,  1.05797786e-02,
        1.85257164e-01, -5.08422852e+09, -5.08422852e+09, -4.20746419e+09,
       -9.90963300e+09, -1.25958923e+10, -1.02813893e+10, -7.10063544e+09,
       -6.55987570e+09, -4.96538071e+09,  2.86823501e+10,  3.19318929e+10,
        2.99192453e+10,  2.00970207e+10,  2.80608865e+10,  2.28243303e+10,
        2.22442105e+10,  2.12225081e+10])

Regression Coefficients

In [51]:
coeff_df = pd.DataFrame(model.coef_[0], X_train.columns, columns=['Coefficient'])  
coeff_df
Out[51]:
Coefficient
Occupation 7.761188e-03
Marital_Status -5.792317e-03
Product_Category_1 -2.402917e-01
Product_Category_2 1.057978e-02
Product_Category_3 1.852572e-01
Gender_F -5.084229e+09
Gender_M -5.084229e+09
Age_0-17 -4.207464e+09
Age_18-25 -9.909633e+09
Age_26-35 -1.259589e+10
Age_36-45 -1.028139e+10
Age_46-50 -7.100635e+09
Age_51-55 -6.559876e+09
Age_55+ -4.965381e+09
City_Category_A 2.868235e+10
City_Category_B 3.193189e+10
City_Category_C 2.991925e+10
Stay_In_Current_City_Years_0 2.009702e+10
Stay_In_Current_City_Years_1 2.806089e+10
Stay_In_Current_City_Years_2 2.282433e+10
Stay_In_Current_City_Years_3 2.224421e+10
Stay_In_Current_City_Years_4+ 2.122251e+10
In [52]:
ax = coeff_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)
plt.savefig("../images/regression_coef.png", bbox_inches = "tight")

Coefficients Correlation heatmap

In [53]:
import seaborn as sns
corr = X.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

LASSO Model

https://www.statisticshowto.datasciencecentral.com/lasso-regression/

Lasso (least absolute shrinkage and selection operator) regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

Multicollinearity

Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are: a person’s height and weight, age and sales price of a car, or years of education and annual income. An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.

It’s more common for multicollineariy to rear its ugly head in observational studies; it’s less common with experimental data. When the condition is present, it can result in unstable and unreliable regression estimates. Several other problems can interfere with analysis of results, including:

  • The t-statistic will generally be very small and coefficient confidence intervals will be very wide. This means that it is harder to reject the null hypothesis.
  • The partial regression coefficient may be an imprecise estimate; standard errors may be very large.
  • Partial regression coefficients may have sign and/or magnitude changes as they pass from sample to sample.
  • Multicollinearity makes it difficult to gauge the effect of independent variables on dependent variables.
In [54]:
# LASSO model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=.01).fit(X_train_scaled, y_train_scaled)

lasso_predictions = lasso.predict(X_test_scaled)

MSE = mean_squared_error(y_test_scaled, lasso_predictions)
r2 = lasso.score(X_test_scaled, y_test_scaled)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 0.8653677199826276, RMSE: 0.9302514283690339, R2: 0.13485717481979587
In [55]:
lasso.coef_
Out[55]:
array([ 7.62125669e-04,  0.00000000e+00, -2.32546402e-01,  2.01768959e-03,
        1.79751440e-01, -3.28277423e-02,  2.08250071e-18, -6.54676112e-03,
       -6.03584446e-03, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        9.27913935e-03,  0.00000000e+00, -7.08723474e-03, -0.00000000e+00,
        4.35880563e-02, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00])
In [56]:
lasso_df = pd.DataFrame(lasso.coef_, X_train.columns, columns=['Coefficient'])  
lasso_df
Out[56]:
Coefficient
Occupation 7.621257e-04
Marital_Status 0.000000e+00
Product_Category_1 -2.325464e-01
Product_Category_2 2.017690e-03
Product_Category_3 1.797514e-01
Gender_F -3.282774e-02
Gender_M 2.082501e-18
Age_0-17 -6.546761e-03
Age_18-25 -6.035844e-03
Age_26-35 -0.000000e+00
Age_36-45 0.000000e+00
Age_46-50 0.000000e+00
Age_51-55 9.279139e-03
Age_55+ 0.000000e+00
City_Category_A -7.087235e-03
City_Category_B -0.000000e+00
City_Category_C 4.358806e-02
Stay_In_Current_City_Years_0 -0.000000e+00
Stay_In_Current_City_Years_1 0.000000e+00
Stay_In_Current_City_Years_2 0.000000e+00
Stay_In_Current_City_Years_3 -0.000000e+00
Stay_In_Current_City_Years_4+ 0.000000e+00
In [57]:
ax = lasso_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)
plt.savefig("../images/lasso_coef.png", bbox_inches = "tight")

Ridge Model

https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.

In [58]:
# Ridge model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import Ridge

### BEGIN SOLUTION
ridge = Ridge(alpha=.01).fit(X_train_scaled, y_train_scaled)

ridge_predictions = ridge.predict(X_test_scaled)

MSE = mean_squared_error(y_test_scaled, ridge_predictions)
r2 = ridge.score(X_test_scaled, y_test_scaled)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 0.8644626684278021, RMSE: 0.929764845768973, R2: 0.13576199116664456
In [59]:
ridge_df = pd.DataFrame(ridge.coef_[0], X_train.columns, columns=['Coefficient'])  
ridge_df
Out[59]:
Coefficient
Occupation 0.007802
Marital_Status -0.005756
Product_Category_1 -0.240265
Product_Category_2 0.010592
Product_Category_3 0.185267
Gender_F -0.020570
Gender_M 0.020570
Age_0-17 -0.017407
Age_18-25 -0.014206
Age_26-35 -0.000695
Age_36-45 0.004848
Age_46-50 0.002383
Age_51-55 0.019038
Age_55+ 0.006267
City_Category_A -0.024754
City_Category_B -0.012805
City_Category_C 0.037397
Stay_In_Current_City_Years_0 -0.002776
Stay_In_Current_City_Years_1 -0.000278
Stay_In_Current_City_Years_2 0.002282
Stay_In_Current_City_Years_3 -0.000605
Stay_In_Current_City_Years_4+ 0.001177
In [60]:
ax = ridge_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)
plt.savefig("../images/ridge_coef.png", bbox_inches = "tight")

ElasticNet Model

https://hackernoon.com/an-introduction-to-ridge-lasso-and-elastic-net-regression-cca60b4b934f

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

In [61]:
# ElasticNet model
# Note: Use an alpha of .01 when creating the model for this activity
from sklearn.linear_model import ElasticNet

### BEGIN SOLUTION
elasticnet = ElasticNet(alpha=.01).fit(X_train_scaled, y_train_scaled)

elasticnet_predictions = elasticnet.predict(X_test_scaled)

MSE = mean_squared_error(y_test_scaled, elasticnet_predictions)
r2 = elasticnet.score(X_test_scaled, y_test_scaled)
RMSE = np.sqrt(MSE)
### END SOLUTION

print(f"MSE: {MSE}, RMSE: {RMSE}, R2: {r2}")
MSE: 0.8648014696962177, RMSE: 0.9299470252096179, R2: 0.1354232779470924
In [62]:
elasticnet_df = pd.DataFrame(elasticnet.coef_, X_train.columns, columns=['Coefficient'])  
elasticnet_df
Out[62]:
Coefficient
Occupation 0.004494
Marital_Status -0.000000
Product_Category_1 -0.235325
Product_Category_2 0.006403
Product_Category_3 0.182001
Gender_F -0.020805
Gender_M 0.016150
Age_0-17 -0.011861
Age_18-25 -0.010402
Age_26-35 -0.000000
Age_36-45 0.000137
Age_46-50 0.000000
Age_51-55 0.013057
Age_55+ 0.000942
City_Category_A -0.010457
City_Category_B -0.000000
City_Category_C 0.046565
Stay_In_Current_City_Years_0 -0.000000
Stay_In_Current_City_Years_1 -0.000000
Stay_In_Current_City_Years_2 0.000000
Stay_In_Current_City_Years_3 -0.000000
Stay_In_Current_City_Years_4+ 0.000000
In [63]:
ax = elasticnet_df.plot(kind='bar', title='Variable Coefficients')
ax.grid(zorder=0)
plt.savefig("../images/elasticnet_coef.png", bbox_inches = "tight")

Decision Tree Regressor Model

https://gdcoder.com/decision-tree-regressor-explained-in-depth/

Decision tree regression observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output. Continuous output means that the output/result is not discrete, i.e., it is not represented just by a discrete, known set of numbers or values.

Decision tree regressor uses MSE and similar metrics to determine splits.

In [64]:
# Read both train and test data
train = pd.read_csv("../../data/trainCLEAN.csv")
testFinal = pd.read_csv("../../data/testCLEAN.csv")
In [66]:
X = train[['Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3']]
y = train['Purchase'].values.reshape(-1, 1)

Train Test Split Data

In [67]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

Model Predictions & Evaluation

To evaluate the model we will use the MSE (Mean Square Error), RMSE (Root Mean Squared Error) and R2

MSE: Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

RMSE: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. RMSE is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

Regressor_Score: Returns the coefficient of determination R^2 of the prediction.

R2: R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. In general, the higher the R-squared, the better the model fits your data.

R-squared as a Biased Estimate R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.

  • An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
  • A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!

R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.

Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about using the adjusted R-squared to compare regression models with a different number of terms.

In [68]:
# https://stackoverflow.com/questions/46139186/interpreting-the-decisiontreeregressor-score
from sklearn.tree import DecisionTreeRegressor  
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import metrics

regressor_model = DecisionTreeRegressor(max_depth=8, min_samples_leaf=150)
regressor_model = regressor_model.fit(X_train, y_train)
regressor_predictions = regressor_model.predict(X_test)
regressor_score = regressor_model.score(X_train,y_train)

print('Purchase Predictions:', regressor_predictions)
print('MAE:', metrics.mean_absolute_error(y_test, regressor_predictions))
print('MSE:', metrics.mean_squared_error(y_test, regressor_predictions))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, regressor_predictions)))
print(f"Model Accuracy (score):{regressor_score*100}%")
# print('Test Variance score: %.2f' % r2_score(y_test, regressor_predictions))
Purchase Predictions: [ 7674.25436568  7674.25436568 11570.13770492 ...  6146.54627034
  6146.54627034  7359.61284184]
MAE: 2260.588130797274
MSE: 8957329.239800084
RMSE 2992.879756989927
Model Accuracy (score):64.25439064054648%

Coefficient of Determination

In [69]:
from sklearn.model_selection import cross_val_score
R2_coefficient_determination = cross_val_score(regressor_model, X_train, y_train, cv=10)
R2_coefficient_determination

print('Coefficient of Determination (a.k.a R2 in LR):', R2_coefficient_determination)
Coefficient of Determination (a.k.a R2 in LR): [0.63650802 0.64540879 0.63940972 0.63913948 0.64333521 0.64147487
 0.63732935 0.64210934 0.64829829 0.64265044]
In [70]:
# DecisionTree in sklearn will automatically calculate feature importance
importances = regressor_model.feature_importances_
importances
Out[70]:
array([5.75593739e-05, 6.49270147e-04, 4.14766236e-04, 4.26173594e-03,
       3.55314135e-04, 7.06281692e-05, 9.77697366e-01, 8.53728874e-03,
       7.95607108e-03])

Model Coefficients

In [71]:
# We can sort the features by order of importance
regressor_coef = sorted(zip(regressor_model.feature_importances_, train.columns), reverse=True)
regressor_coef = pd.DataFrame(regressor_coef)
regressor_coef.columns = ['Coefficients', 'Features']

# Save to csv and HTML
regressor_coef.to_csv('../../data/regressor_coef.html')
regressor_coef.to_html('../../data/regressor_coef.html')
regressor_coef
Out[71]:
Coefficients Features
0 0.977697 Product_Category_1
1 0.008537 Product_Category_2
2 0.007956 Product_Category_3
3 0.004262 City_Category
4 0.000649 Age
5 0.000415 Occupation
6 0.000355 Stay_In_Current_City_Years
7 0.000071 Marital_Status
8 0.000058 Gender
In [72]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(regressor_model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
C:\Users\pablo\Anaconda3\envs\PythonData\lib\site-packages\sklearn\externals\six.py:31: DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", DeprecationWarning)
Out[72]:

Residual Plot

In [92]:
# https://acadgild.com/blog/using-decision-trees-for-regression-problems
fig, ax = plt.subplots()
ax.scatter(y_test, regressor_predictions, edgecolors='white')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')

plt.show()
plt.savefig("../images/regressor_plot.png", bbox_inches = "tight")
<Figure size 432x288 with 0 Axes>

Predictions Comparison

In [74]:
# source: https://stackoverflow.com/questions/42049147/convert-list-to-pandas-dataframe-column
predicted = pd.DataFrame(np.array([regressor_predictions]).T)
actual = pd.DataFrame(y_test)
results = pd.merge(actual, predicted, left_index=True, right_index=True)
results.rename(columns = {'0_x':'Actual_Purchase', '0_y': 'Predicted_Purchase'}, inplace = True) 
results
Out[74]:
Actual_Purchase Predicted_Purchase
0 8091 7674.254366
1 6114 7674.254366
2 12714 11570.137705
3 11799 12922.596573
4 3482 6146.546270
... ... ...
109178 3562 6146.546270
109179 3538 2519.900466
109180 6897 6146.546270
109181 5452 6146.546270
109182 7779 7359.612842

109183 rows × 2 columns

Predict Original Test File

In [75]:
# Let's predict the original test file 
test_predict = regressor_model.predict(testFinal)
test_predict
Out[75]:
array([13041.11003348, 10180.08048037,  6146.54627034, ...,
       12922.59657327, 20165.58522727,  2179.18763368])
In [76]:
# Put output in a dataframe

test_predict_output = pd.DataFrame(np.array([test_predict]).T)
test_predict_output['Test_Purchase'] = test_predict_output 

test_predict_output.drop(test_predict_output.columns[0], inplace = True, axis =1)
test_predict_output
Out[76]:
Test_Purchase
0 13041.110033
1 10180.080480
2 6146.546270
3 2179.187634
4 2519.900466
... ...
233594 7359.612842
233595 6146.546270
233596 12922.596573
233597 20165.585227
233598 2179.187634

233599 rows × 1 columns

Test Table results

In [78]:
# Create Table of Test file with results 
table = pd.merge(test[['User_ID', 'Product_ID']], testFinal, left_index = True, right_index = True)
testTableFinal = pd.merge(table, test_predict_output, left_index = True, right_index = True)


#Save to CSV
testTableFinal.to_csv("../../data/testTableFinalCLEAN.csv", index=False, encoding='utf8')

# Save to HTML
testTableFinal.to_html('../../data/testTableFinalCLEAN.html')
testTableFinal

# Save to Json

testTableFinal.to_json("../../data/testTableFinalCLEAN.json", orient='records')
testTableFinal
Out[78]:
User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Test_Purchase
0 1000004 P00128942 1 4 7 1 2 1 1 11.0 0.0 13041.110033
1 1000009 P00113442 1 2 17 2 0 0 3 5.0 0.0 10180.080480
2 1000010 P00288442 0 3 1 1 4 1 5 14.0 0.0 6146.546270
3 1000010 P00145342 0 3 1 1 4 1 4 9.0 0.0 2179.187634
4 1000011 P00053842 0 2 1 2 1 0 4 5.0 12.0 2519.900466
... ... ... ... ... ... ... ... ... ... ... ... ...
233594 1006036 P00118942 0 2 15 1 4 1 8 0.0 0.0 7359.612842
233595 1006036 P00254642 0 2 15 1 4 1 5 8.0 0.0 6146.546270
233596 1006036 P00031842 0 2 15 1 4 1 1 5.0 12.0 12922.596573
233597 1006037 P00124742 0 4 1 2 4 0 10 16.0 0.0 20165.585227
233598 1006039 P00316642 0 4 0 1 4 1 4 5.0 0.0 2179.187634

233599 rows × 12 columns

In [80]:
# Save 100 records to a table

testTableFinal1 = testTableFinal[:50]  # same as df.head(50)
testTableFinal2 = testTableFinal[-50:] # same as df.tail(20)
testTableFinalAppend = testTableFinal1.append(testTableFinal2, ignore_index = True)
testTableFinalAppend

#Save to CSV
testTableFinalAppend.to_csv("../../data/testTableFinalCLEAN100.csv", index=False, encoding='utf8')

# Save to HTML
testTableFinalAppend.to_html('../../data/testTableFinalCLEAN100.html')
testTableFinal

# Save to Json

testTableFinalAppend.to_json("../../data/testTableFinalCLEAN100.json", orient='records')
testTableFinalAppend
Out[80]:
User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Test_Purchase
0 1000004 P00128942 1 4 7 1 2 1 1 11.0 0.0 13041.110033
1 1000009 P00113442 1 2 17 2 0 0 3 5.0 0.0 10180.080480
2 1000010 P00288442 0 3 1 1 4 1 5 14.0 0.0 6146.546270
3 1000010 P00145342 0 3 1 1 4 1 4 9.0 0.0 2179.187634
4 1000011 P00053842 0 2 1 2 1 0 4 5.0 12.0 2519.900466
... ... ... ... ... ... ... ... ... ... ... ... ...
95 1006036 P00118942 0 2 15 1 4 1 8 0.0 0.0 7359.612842
96 1006036 P00254642 0 2 15 1 4 1 5 8.0 0.0 6146.546270
97 1006036 P00031842 0 2 15 1 4 1 1 5.0 12.0 12922.596573
98 1006037 P00124742 0 4 1 2 4 0 10 16.0 0.0 20165.585227
99 1006039 P00316642 0 4 0 1 4 1 4 5.0 0.0 2179.187634

100 rows × 12 columns

In [81]:
# Create table to input in competition
test_predict_output = pd.merge(test[['User_ID', 'Product_ID']], test_predict_output, left_index = True, right_index = True)
test_predict_output
Out[81]:
User_ID Product_ID Test_Purchase
0 1000004 P00128942 13041.110033
1 1000009 P00113442 10180.080480
2 1000010 P00288442 6146.546270
3 1000010 P00145342 2179.187634
4 1000011 P00053842 2519.900466
... ... ... ...
233594 1006036 P00118942 7359.612842
233595 1006036 P00254642 6146.546270
233596 1006036 P00031842 12922.596573
233597 1006037 P00124742 20165.585227
233598 1006039 P00316642 2179.187634

233599 rows × 3 columns

In [82]:
test_predict_output.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 3 columns):
User_ID          233599 non-null int64
Product_ID       233599 non-null object
Test_Purchase    233599 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.3+ MB
In [83]:
test_predict_output['UserID_and_ProductID'] = test_predict_output["User_ID"].map(str) + test_predict_output["Product_ID"].map(str)
test_predict_output
Out[83]:
User_ID Product_ID Test_Purchase UserID_and_ProductID
0 1000004 P00128942 13041.110033 1000004P00128942
1 1000009 P00113442 10180.080480 1000009P00113442
2 1000010 P00288442 6146.546270 1000010P00288442
3 1000010 P00145342 2179.187634 1000010P00145342
4 1000011 P00053842 2519.900466 1000011P00053842
... ... ... ... ...
233594 1006036 P00118942 7359.612842 1006036P00118942
233595 1006036 P00254642 6146.546270 1006036P00254642
233596 1006036 P00031842 12922.596573 1006036P00031842
233597 1006037 P00124742 20165.585227 1006037P00124742
233598 1006039 P00316642 2179.187634 1006039P00316642

233599 rows × 4 columns

In [84]:
# Lets drop User ID and Product ID
drop_columns = ['User_ID', 'Product_ID']
test_predict_output.drop(drop_columns, inplace = True, axis =1)
test_predict_output
Out[84]:
Test_Purchase UserID_and_ProductID
0 13041.110033 1000004P00128942
1 10180.080480 1000009P00113442
2 6146.546270 1000010P00288442
3 2179.187634 1000010P00145342
4 2519.900466 1000011P00053842
... ... ...
233594 7359.612842 1006036P00118942
233595 6146.546270 1006036P00254642
233596 12922.596573 1006036P00031842
233597 20165.585227 1006037P00124742
233598 2179.187634 1006039P00316642

233599 rows × 2 columns

In [85]:
# Combine UserID and ProductID into one column and export to csv
test_predict_output = test_predict_output[['UserID_and_ProductID', 'Test_Purchase']]

#Save to CSV

test_predict_output.to_csv("../../data/testOutputCLEAN.csv", index=False, encoding='utf8')

test_predict_output
Out[85]:
UserID_and_ProductID Test_Purchase
0 1000004P00128942 13041.110033
1 1000009P00113442 10180.080480
2 1000010P00288442 6146.546270
3 1000010P00145342 2179.187634
4 1000011P00053842 2519.900466
... ... ...
233594 1006036P00118942 7359.612842
233595 1006036P00254642 6146.546270
233596 1006036P00031842 12922.596573
233597 1006037P00124742 20165.585227
233598 1006039P00316642 2179.187634

233599 rows × 2 columns